Can we use attributes and metrics of rock songs to predict their likelihood of winning a Grammy?

What common traits, if any, do award winning rock songs contain? Can we look at intrinsic traits of songs, combined with metrics defined by Spotify, to determine award winning musical features? In this paper, we break down our data collection, data processing, and data analysis of a dataset of roughly 1,000 popular songs, both award-winning and not.

The Data

Data Collection and Selection

JOAN TO DO: Write how we chose the 1,000 songs originally; how we ended up with 867; etc.

# Set the working directory to this file's folder
library("rstudioapi")
setwd(dirname(getActiveDocumentContext()$path))
load("final_df_n_str.RData")

Sys.setenv(LANG = "en") 

# Load necessary libraries
library(pROC)
library(MASS)
library(ROSE)
library(confintr)
library(ggplot2)
library(correlation)
library(corrplot)
library(class)
library(caret)
library(glmnet)
# Selecting the relevant variables
data = final_df_n_str
data = data[,c("track_name", "artist_name", "IsWinner", "Year","year",
               "followers", "acousticness", "danceability", "duration_ms",
               "energy", "instrumentalness", "key", "liveness", "loudness",
               "mode", "tempo", "time_signature", "valence")]

# Merge the two year variable
data$Year[data$Year == "Undefined"] <- data$year[data$Year == "Undefined"]
data = data[,c("track_name","artist_name", "IsWinner", "Year", "followers",
               "acousticness", "danceability", "duration_ms",
               "energy", "instrumentalness", "key", "liveness", "loudness",
               "mode", "tempo", "time_signature", "valence")]

# Eliminating duplicates
data$track_name == "Closing Time"
data$track_name == "Smells Like Teen Spirit"
data$track_name == "Don't Wanna Fight"
data[914, ]
data[789,]
data[669,]

data = data[-c(669, 789, 914),]

sum(data$Year < 1992)
nrow(data)
data = data[!data$Year < 1992,]

# Creating row names

names = paste0(data$track_name, " - ", data$artist_name)

# Eliminating unusable variables
data = data[,c("IsWinner", "Year", "followers", "acousticness",
               "danceability", "duration_ms", "energy",
               "instrumentalness", "key", "liveness", "loudness", "mode",
               "tempo", "time_signature", "valence")]
data = cbind(names = names, data)

# Casting variables
data$IsWinner[data$IsWinner == "Winner"] = 1
data$IsWinner[data$IsWinner == "Nominee"] = 1
data$IsWinner[data$IsWinner == "Nothing"] = 0
data$IsWinner = as.integer(data$IsWinner)
data$Year = as.integer(data$Year)
data$mode = as.factor(data$mode)
data$key = as.factor(data$key)
data$time_signature = as.factor(data$time_signature)

# Giving row names
summary(data)
summary(data$IsWinner)

Explanation of Variables

In order to perform analysis of the songs, we decided to use metrics that are intrinsic to music as well as artificial metrics created and measured by the music streaming giant Spotify. The intrinsic metrics we used were: duration, musical key, modality (major or minor key), tempo, and time signature. Spotify also uses what they call “audio features” (in the table below) to perform their own analysis of songs when creating playlists, suggesting music, etc. We used these professionally manufactured metrics to bolster the intrinsic metrics and increase our insight into what might make a song award-winning.

Audio Feature Definition
Acousticness A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
Danceability Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
Energy Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
Instrumentalness Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
Liveness Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
Loudness The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db.
Speechiness Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
Valence A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

Although Spotify does not openly share how they determine these metrics, we found them suitable to assist in our analysis.

As a final processing step, we split the data into training and test datasets. The training dataset contains 80% of the original dataset, and the remaining 20% of the data in the test dataset will be used to test tour model against after we have trained it. It is very important to test the model on never-before-seen data to determine not only how well the model performs, but also how well the model can generalize.

# Splitting training and test set
training_size = floor(0.8 * nrow(data))
set.seed(42)
train_ind = sample(seq_len(nrow(data)), size = training_size)
training_set = data[train_ind,]
test_set = data[-train_ind,]

summary(training_set)
##     names              IsWinner           Year        followers       
##  Length:693         Min.   :0.0000   Min.   :1992   Min.   :    2597  
##  Class :character   1st Qu.:0.0000   1st Qu.:2001   1st Qu.:  868777  
##  Mode  :character   Median :0.0000   Median :2010   Median : 2350118  
##                     Mean   :0.1876   Mean   :2009   Mean   : 4338356  
##                     3rd Qu.:0.0000   3rd Qu.:2018   3rd Qu.: 5615666  
##                     Max.   :1.0000   Max.   :2023   Max.   :44692754  
##                                                                       
##   acousticness        danceability    duration_ms          energy      
##  Min.   :0.0000032   Min.   :0.130   Min.   :  78591   Min.   :0.0975  
##  1st Qu.:0.0016900   1st Qu.:0.419   1st Qu.: 206413   1st Qu.:0.6040  
##  Median :0.0278000   Median :0.522   Median : 237800   Median :0.7570  
##  Mean   :0.1553733   Mean   :0.512   Mean   : 251635   Mean   :0.7182  
##  3rd Qu.:0.2050000   3rd Qu.:0.607   3rd Qu.: 278267   3rd Qu.:0.8820  
##  Max.   :0.9880000   Max.   :0.894   Max.   :1355938   Max.   :0.9960  
##                                                                        
##  instrumentalness        key         liveness         loudness       mode   
##  Min.   :0.00e+00   9      : 95   Min.   :0.0157   Min.   :-18.148   0:203  
##  1st Qu.:4.90e-06   2      : 94   1st Qu.:0.0989   1st Qu.: -8.086   1:490  
##  Median :3.21e-04   7      : 84   Median :0.1240   Median : -6.253          
##  Mean   :6.25e-02   0      : 81   Mean   :0.2004   Mean   : -6.645          
##  3rd Qu.:1.49e-02   11     : 68   3rd Qu.:0.2320   3rd Qu.: -4.767          
##  Max.   :8.95e-01   4      : 58   Max.   :0.9980   Max.   : -1.574          
##                     (Other):213                                             
##      tempo        time_signature    valence      
##  Min.   : 48.58   1:  2          Min.   :0.0494  
##  1st Qu.: 99.19   3: 37          1st Qu.:0.3050  
##  Median :121.14   4:649          Median :0.4640  
##  Mean   :123.28   5:  5          Mean   :0.4725  
##  3rd Qu.:141.93                  3rd Qu.:0.6310  
##  Max.   :205.85                  Max.   :0.9730  
## 
# Checking if the ratio is preserved
sum(data$IsWinner == 1)/ sum(data$IsWinner == 0)
## [1] 0.2159888
sum(training_set$IsWinner == 1)/ sum(training_set$IsWinner == 0)
## [1] 0.2309059
training_set
## # A tibble: 693 × 16
##    names   IsWinner  Year followers acousticness danceability duration_ms energy
##    <chr>      <int> <int>     <int>        <dbl>        <dbl>       <int>  <dbl>
##  1 Nightm…        0  2010   6262809     0.000318        0.554      374453  0.949
##  2 I'd Do…        0  1993   1034322     0.465           0.366      718600  0.561
##  3 Patien…        1  2022   4802169     0.000195        0.318      441402  0.87 
##  4 Someda…        1  2006   6137375     0.254           0.533      295560  0.59 
##  5 I Know…        0  2020   1775452     0.33            0.323      344693  0.323
##  6 Find M…        1  2021   4416749     0.256           0.873      293849  0.809
##  7 Weak -…        0  2017   2923531     0.118           0.67       201159  0.643
##  8 Walk O…        1  2001  11148674     0.00379         0.528      296240  0.832
##  9 Black …        1  2018   2341237     0.197           0.558      259893  0.902
## 10 Spectr…        0  2012   6399322     0.00225         0.578      218190  0.946
## # ℹ 683 more rows
## # ℹ 8 more variables: instrumentalness <dbl>, key <fct>, liveness <dbl>,
## #   loudness <dbl>, mode <fct>, tempo <dbl>, time_signature <fct>,
## #   valence <dbl>

Exploratory Data Analysis

Relationship Between Independent Variables

At first, we took a look at the continuous variables.

attach(training_set)

# Correlations between continuous variables
cor_matrix = cor(training_set[,c(-1, -2, -10, -13, -15)])
corrplot(cor_matrix)

pairs(training_set[,c(-1, -2, -10, -13, -15)], lower.panel = panel.smooth)

WHY CAN’T I GET THIS PAIRS() PDF TO INSERT?? Maybe it is inserted, but too big??

knitr::include_graphics("yourPlot.pdf", error = FALSE)

We looked at the association measure for categorical variables utilizing Cramer’s V, which is a normalized version of the chi-square statistic.

(CRISTIAN IS THIS CORRECT? EXPAND?)

# Association measure for categorical variables (Cramer's V is a normalized 
# version of the chi-square statistics)
cramersv(matrix(c(as.numeric(key), as.numeric(mode)), ncol = 2))
## [1] 0.3275984
cramersv(matrix(c(as.numeric(key), as.numeric(time_signature)), ncol = 2))
## [1] 0.305952
cramersv(matrix(c(as.numeric(mode), as.numeric(time_signature)), ncol = 2))
## [1] 0.1425218

Next, we looked for associations between each of the categorical variables (Key, Mode, and Time Signature) and all of the continuous variables. Some of these were significant, meaning… (ASK CRISTIAN ABOUT THIS)

# Association between continuous and categorical variables

# Key
fol_key.aov <- aov(followers ~ key)
summary(fol_key.aov) # SIGNIFICANT
##              Df    Sum Sq   Mean Sq F value Pr(>F)  
## key          11 9.557e+14 8.688e+13   2.129 0.0167 *
## Residuals   681 2.780e+16 4.082e+13                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
aco_key.aov <- aov(acousticness ~ key)
summary(aco_key.aov)
##              Df Sum Sq Mean Sq F value Pr(>F)
## key          11   0.76 0.06888   1.154  0.316
## Residuals   681  40.65 0.05969
dan_key.aov <- aov(danceability ~ key)
summary(dan_key.aov) # SIGNIFICANT
##              Df Sum Sq Mean Sq F value Pr(>F)  
## key          11  0.368 0.03343   1.911 0.0351 *
## Residuals   681 11.913 0.01749                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
dur_key.aov <- aov(duration_ms ~ key)
summary(dur_key.aov)
##              Df    Sum Sq   Mean Sq F value Pr(>F)  
## key          11 1.302e+11 1.184e+10   1.671 0.0758 .
## Residuals   681 4.825e+12 7.086e+09                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ene_key.aov <- aov(energy ~ key)
summary(ene_key.aov) # SIGNIFICANT
##              Df Sum Sq Mean Sq F value Pr(>F)  
## key          11  0.806 0.07330   1.926 0.0334 *
## Residuals   681 25.922 0.03806                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ins_key.aov <- aov(instrumentalness ~ key)
summary(ins_key.aov)
##              Df Sum Sq Mean Sq F value Pr(>F)  
## key          11  0.503 0.04574   1.696 0.0701 .
## Residuals   681 18.362 0.02696                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
liv_key.aov <- aov(liveness ~ key)
summary(liv_key.aov)
##              Df Sum Sq Mean Sq F value Pr(>F)
## key          11  0.437 0.03974   1.184  0.294
## Residuals   681 22.859 0.03357
loud_key.aov <- aov(loudness ~ key)
summary(loud_key.aov)
##              Df Sum Sq Mean Sq F value Pr(>F)
## key          11    109   9.871   1.467  0.139
## Residuals   681   4583   6.730
tem_key.aov <- aov(tempo ~ key)
summary(tem_key.aov)
##              Df Sum Sq Mean Sq F value Pr(>F)
## key          11  12755  1159.5   1.384  0.176
## Residuals   681 570619   837.9
val_key.aov <- aov(valence ~ key)
summary(val_key.aov)
##              Df Sum Sq Mean Sq F value Pr(>F)
## key          11   0.77 0.07024   1.486  0.132
## Residuals   681  32.18 0.04725
# Mode
fol_mode.aov <- aov(followers ~ mode)
summary(fol_mode.aov) # SIGNIFICANT
##              Df    Sum Sq   Mean Sq F value Pr(>F)  
## mode          1 2.630e+14 2.630e+14    6.38 0.0118 *
## Residuals   691 2.849e+16 4.123e+13                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
aco_mode.aov <- aov(acousticness ~ mode)
summary(aco_mode.aov) # SIGNIFICANT
##              Df Sum Sq Mean Sq F value Pr(>F)  
## mode          1   0.32  0.3200   5.382 0.0206 *
## Residuals   691  41.09  0.0595                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
dan_mode.aov <- aov(danceability ~ mode)
summary(dan_mode.aov)
##              Df Sum Sq Mean Sq F value Pr(>F)
## mode          1  0.012 0.01171    0.66  0.417
## Residuals   691 12.269 0.01775
dur_mode.aov <- aov(duration_ms ~ mode)
summary(dur_mode.aov)
##              Df    Sum Sq   Mean Sq F value Pr(>F)
## mode          1 9.401e+07 9.401e+07   0.013  0.909
## Residuals   691 4.955e+12 7.171e+09
ene_mode.aov <- aov(energy ~ mode)
summary(ene_mode.aov) # SIGNIFICANT
##              Df Sum Sq Mean Sq F value  Pr(>F)   
## mode          1  0.374  0.3737   9.798 0.00182 **
## Residuals   691 26.354  0.0381                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ins_mode.aov <- aov(instrumentalness ~ mode)
summary(ins_mode.aov)
##              Df Sum Sq Mean Sq F value Pr(>F)
## mode          1  0.017 0.01729   0.634  0.426
## Residuals   691 18.848 0.02728
liv_mode.aov <- aov(liveness ~ mode)
summary(liv_mode.aov)
##              Df Sum Sq Mean Sq F value Pr(>F)
## mode          1  0.038 0.03803    1.13  0.288
## Residuals   691 23.258 0.03366
loud_mode.aov <- aov(loudness ~ mode)
summary(loud_mode.aov) # SIGNIFICANT
##              Df Sum Sq Mean Sq F value  Pr(>F)   
## mode          1     59   59.19   8.828 0.00307 **
## Residuals   691   4633    6.70                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tem_mode.aov <- aov(tempo ~ mode)
summary(tem_mode.aov)
##              Df Sum Sq Mean Sq F value Pr(>F)
## mode          1    928   928.2   1.101  0.294
## Residuals   691 582445   842.9
val_mode.aov <- aov(valence ~ mode)
summary(val_mode.aov)
##              Df Sum Sq Mean Sq F value Pr(>F)
## mode          1   0.00 0.00049    0.01   0.92
## Residuals   691  32.95 0.04769
# Time signature
fol_time.aov <- aov(followers ~ time_signature)
summary(fol_time.aov)
##                 Df    Sum Sq   Mean Sq F value Pr(>F)
## time_signature   3 1.400e+13 4.667e+12   0.112  0.953
## Residuals      689 2.874e+16 4.171e+13
aco_time.aov <- aov(acousticness ~ time_signature)
summary(aco_time.aov) # SIGNIFICANT
##                 Df Sum Sq Mean Sq F value   Pr(>F)    
## time_signature   3   1.27  0.4238   7.275 8.28e-05 ***
## Residuals      689  40.14  0.0583                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
dan_time.aov <- aov(danceability ~ time_signature)
summary(dan_time.aov) # SIGNIFICANT
##                 Df Sum Sq Mean Sq F value  Pr(>F)   
## time_signature   3  0.239 0.07964   4.557 0.00359 **
## Residuals      689 12.042 0.01748                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
dur_time.aov <- aov(duration_ms ~ time_signature)
summary(dur_time.aov) # SIGNIFICANT
##                 Df    Sum Sq   Mean Sq F value   Pr(>F)    
## time_signature   3 1.156e+11 3.855e+10   5.488 0.000993 ***
## Residuals      689 4.840e+12 7.024e+09                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ene_time.aov <- aov(energy ~ time_signature)
summary(ene_time.aov) # SIGNIFICANT
##                 Df Sum Sq Mean Sq F value   Pr(>F)    
## time_signature   3   0.74 0.24678   6.543 0.000229 ***
## Residuals      689  25.99 0.03772                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ins_time.aov <- aov(instrumentalness ~ time_signature)
summary(ins_time.aov) # SIGNIFICANT
##                 Df Sum Sq Mean Sq F value  Pr(>F)   
## time_signature   3   0.34 0.11337   4.217 0.00573 **
## Residuals      689  18.52 0.02689                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
liv_time.aov <- aov(liveness ~ time_signature)
summary(liv_time.aov)
##                 Df Sum Sq Mean Sq F value Pr(>F)
## time_signature   3  0.021 0.00686   0.203  0.894
## Residuals      689 23.276 0.03378
loud_time.aov <- aov(loudness ~ time_signature)
summary(loud_time.aov) # SIGNIFICANT
##                 Df Sum Sq Mean Sq F value  Pr(>F)   
## time_signature   3     78  25.887   3.865 0.00927 **
## Residuals      689   4614   6.697                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tem_time.aov <- aov(tempo ~ time_signature)
summary(tem_time.aov) # SIGNIFICANT
##                 Df Sum Sq Mean Sq F value Pr(>F)  
## time_signature   3   7794  2598.1    3.11 0.0259 *
## Residuals      689 575579   835.4                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
val_time.aov <- aov(valence ~ time_signature)
summary(val_time.aov) # SIGNIFICANT
##                 Df Sum Sq Mean Sq F value Pr(>F)  
## time_signature   3   0.52 0.17279   3.671 0.0121 *
## Residuals      689  32.43 0.04707                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The fact that x, y, z, etc. were significant leads us to etc. etc. etc. CHAT WITH CRISTIAN ABOUT THIS TOO.

TALK ABOUT PARTIAL CORRELAtIONS

# Partial correlations
correlation(training_set[,c(-1, -2, -10, -13, -15)], partial = TRUE)
## # Correlation Matrix (pearson-method)
## 
## Parameter1       |       Parameter2 |         r |         95% CI | t(691) |         p
## -------------------------------------------------------------------------------------
## Year             |        followers |     -0.09 | [-0.16, -0.02] |  -2.37 | 0.580    
## Year             |     acousticness |      0.15 | [ 0.07,  0.22] |   3.94 | 0.004**  
## Year             |     danceability |      0.14 | [ 0.07,  0.21] |   3.77 | 0.007**  
## Year             |      duration_ms |     -0.08 | [-0.15,  0.00] |  -2.07 | > .999   
## Year             |           energy |     -0.03 | [-0.10,  0.05] |  -0.74 | > .999   
## Year             | instrumentalness |      0.13 | [ 0.05,  0.20] |   3.33 | 0.035*   
## Year             |         liveness | -8.88e-03 | [-0.08,  0.07] |  -0.23 | > .999   
## Year             |         loudness |      0.24 | [ 0.17,  0.31] |   6.58 | < .001***
## Year             |            tempo |      0.10 | [ 0.02,  0.17] |   2.58 | 0.340    
## Year             |          valence |     -0.12 | [-0.19, -0.05] |  -3.18 | 0.057    
## followers        |     acousticness |     -0.04 | [-0.11,  0.04] |  -1.03 | > .999   
## followers        |     danceability | -7.23e-03 | [-0.08,  0.07] |  -0.19 | > .999   
## followers        |      duration_ms |      0.03 | [-0.05,  0.10] |   0.74 | > .999   
## followers        |           energy |     -0.06 | [-0.14,  0.01] |  -1.71 | > .999   
## followers        | instrumentalness |     -0.06 | [-0.14,  0.01] |  -1.65 | > .999   
## followers        |         liveness |     -0.06 | [-0.13,  0.02] |  -1.47 | > .999   
## followers        |         loudness |      0.11 | [ 0.04,  0.18] |   2.89 | 0.144    
## followers        |            tempo |      0.01 | [-0.06,  0.09] |   0.33 | > .999   
## followers        |          valence |     -0.09 | [-0.16, -0.01] |  -2.28 | 0.687    
## acousticness     |     danceability |     -0.03 | [-0.10,  0.05] |  -0.72 | > .999   
## acousticness     |      duration_ms |     -0.07 | [-0.15,  0.00] |  -1.92 | > .999   
## acousticness     |           energy |     -0.45 | [-0.51, -0.39] | -13.31 | < .001***
## acousticness     | instrumentalness |      0.03 | [-0.05,  0.10] |   0.67 | > .999   
## acousticness     |         liveness |      0.09 | [ 0.01,  0.16] |   2.35 | 0.593    
## acousticness     |         loudness |     -0.13 | [-0.20, -0.05] |  -3.40 | 0.028*   
## acousticness     |            tempo |      0.04 | [-0.04,  0.11] |   0.99 | > .999   
## acousticness     |          valence |      0.07 | [-0.01,  0.14] |   1.79 | > .999   
## danceability     |      duration_ms |     -0.10 | [-0.18, -0.03] |  -2.75 | 0.215    
## danceability     |           energy |     -0.15 | [-0.22, -0.08] |  -3.99 | 0.003**  
## danceability     | instrumentalness |      0.04 | [-0.03,  0.11] |   1.06 | > .999   
## danceability     |         liveness |     -0.13 | [-0.20, -0.05] |  -3.36 | 0.032*   
## danceability     |         loudness |     -0.01 | [-0.09,  0.06] |  -0.33 | > .999   
## danceability     |            tempo |     -0.31 | [-0.37, -0.24] |  -8.50 | < .001***
## danceability     |          valence |      0.53 | [ 0.47,  0.58] |  16.23 | < .001***
## duration_ms      |           energy |      0.06 | [-0.02,  0.13] |   1.49 | > .999   
## duration_ms      | instrumentalness |      0.16 | [ 0.09,  0.23] |   4.35 | < .001***
## duration_ms      |         liveness |     -0.04 | [-0.11,  0.04] |  -0.98 | > .999   
## duration_ms      |         loudness |     -0.08 | [-0.16, -0.01] |  -2.22 | 0.771    
## duration_ms      |            tempo |      0.01 | [-0.06,  0.09] |   0.30 | > .999   
## duration_ms      |          valence |     -0.15 | [-0.22, -0.07] |  -3.86 | 0.005**  
## energy           | instrumentalness |      0.16 | [ 0.09,  0.23] |   4.30 | < .001***
## energy           |         liveness |      0.15 | [ 0.08,  0.23] |   4.11 | 0.002**  
## energy           |         loudness |      0.62 | [ 0.57,  0.66] |  20.64 | < .001***
## energy           |            tempo |      0.04 | [-0.03,  0.12] |   1.10 | > .999   
## energy           |          valence |      0.26 | [ 0.18,  0.32] |   6.95 | < .001***
## instrumentalness |         liveness |     -0.06 | [-0.14,  0.01] |  -1.67 | > .999   
## instrumentalness |         loudness |     -0.18 | [-0.25, -0.11] |  -4.85 | < .001***
## instrumentalness |            tempo |      0.06 | [-0.01,  0.14] |   1.70 | > .999   
## instrumentalness |          valence |     -0.09 | [-0.16, -0.02] |  -2.40 | 0.554    
## liveness         |         loudness |     -0.08 | [-0.15,  0.00] |  -2.02 | > .999   
## liveness         |            tempo |     -0.05 | [-0.13,  0.02] |  -1.38 | > .999   
## liveness         |          valence |      0.01 | [-0.06,  0.09] |   0.29 | > .999   
## loudness         |            tempo |     -0.03 | [-0.10,  0.05] |  -0.70 | > .999   
## loudness         |          valence |  7.25e-04 | [-0.07,  0.08] |   0.02 | > .999   
## tempo            |          valence |      0.16 | [ 0.09,  0.23] |   4.33 | < .001***
## 
## p-value adjustment method: Holm (1979)
## Observations: 693
# Plots of variables with the largest partial correlation
ggplot(data = training_set, aes(danceability, valence)) + geom_jitter(color = "blue")

ggplot(data = training_set, aes(loudness, energy)) + geom_jitter(color = "blue")

ggplot(data = training_set, aes(acousticness, energy)) + geom_jitter(color = "blue")

#Weird song veeeeeeeeeeeeeeeeeeeeeeeery long 
which.max(data$duration_ms)
## [1] 448
data[504, ]
## # A tibble: 1 × 16
##   names    IsWinner  Year followers acousticness danceability duration_ms energy
##   <chr>       <int> <int>     <int>        <dbl>        <dbl>       <int>  <dbl>
## 1 Drops -…        0  2014    886702        0.853        0.703      173627  0.237
## # ℹ 8 more variables: instrumentalness <dbl>, key <fct>, liveness <dbl>,
## #   loudness <dbl>, mode <fct>, tempo <dbl>, time_signature <fct>,
## #   valence <dbl>

TALK ABOUT WHY SOME MAY BE SIGNIFICANT; WHAT THAT MEANS.

Examining the distributions of the attributes among our dataset:

# Checking distributions

par(mfrow= c(2, 5))
 
# Continuous variables

hist(followers)
hist(acousticness)
hist(danceability)
hist(duration_ms)
hist(energy)
hist(instrumentalness)
hist(liveness)
hist(loudness)
hist(tempo)
hist(valence)

ALSO MAY WANT TO INSERT A PDF HERE; PLOT IS CUT OFF

Looking at categorical variables:

# Categorical variables

par(mfrow = c(1, 3))

barplot(table(key), main = "Key distribution")
barplot(table(mode), main = "Mode")
barplot(table(time_signature), main = "Time signature")

Relationship btween dependent and independent variables:

# Relationships between dependent and independent variables

par(mfrow= c(2, 5))

boxplot(danceability ~ training_set$IsWinner)
boxplot(followers ~ training_set$IsWinner)
boxplot(acousticness ~ training_set$IsWinner)
boxplot(duration_ms ~ training_set$IsWinner)
boxplot(energy ~ training_set$IsWinner)
boxplot(instrumentalness ~ training_set$IsWinner)
boxplot(liveness ~ training_set$IsWinner)
boxplot(loudness ~ training_set$IsWinner)
boxplot(tempo ~ training_set$IsWinner)
boxplot(valence ~ training_set$IsWinner)

More examination:

par(mfrow = c(1, 1))

chisq.test(key, training_set$IsWinner)
## 
##  Pearson's Chi-squared test
## 
## data:  key and training_set$IsWinner
## X-squared = 10.443, df = 11, p-value = 0.491
chisq.test(mode, training_set$IsWinner)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mode and training_set$IsWinner
## X-squared = 0, df = 1, p-value = 1
chisq.test(time_signature, training_set$IsWinner)
## 
##  Pearson's Chi-squared test
## 
## data:  time_signature and training_set$IsWinner
## X-squared = 3.3331, df = 3, p-value = 0.3431
cramersv(matrix(c(as.numeric(key), as.numeric(training_set$IsWinner)), ncol = 2))
## [1] 0.4116833
cramersv(matrix(c(as.numeric(mode), as.numeric(training_set$IsWinner)), ncol = 2))
## [1] 0.5604862
cramersv(matrix(c(as.numeric(time_signature), as.numeric(training_set$IsWinner)), ncol = 2))
## [1] 0.4527729
table(mode, training_set$IsWinner)
##     
## mode   0   1
##    0 165  38
##    1 398  92
table(time_signature, training_set$IsWinner)
##               
## time_signature   0   1
##              1   2   0
##              3  33   4
##              4 523 126
##              5   5   0

Conclusion about what we have seen; what we plan to do with it? (Model fitting next)

Model Fitting

Explain our approach to modeling:

Talk about oversampling and why we chose to do it: - Did we see an improvement in prediction compared to the original, unbalanced data set?

## Oversampling

oversampled_train_data = ovun.sample(IsWinner ~., data = training_set[,-1], method = "over", p = 0.5, seed = 42)$data


# Checking oversampled training set balance

sum(oversampled_train_data$IsWinner == 0)
## [1] 563
sum(oversampled_train_data$IsWinner == 1) 
## [1] 538

Logistic Model

Starting with a logistic model (Explain what it does?)

## Simple logistic model

logistic = glm(IsWinner ~ ., data = training_set[,c(-1,-2)], family = "binomial")
summary(logistic)
## 
## Call:
## glm(formula = IsWinner ~ ., family = "binomial", data = training_set[, 
##     c(-1, -2)])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4740  -0.6879  -0.5250  -0.2622   2.5034  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)   
## (Intercept)       9.965e+00  1.014e+03   0.010  0.99216   
## Year             -1.434e-02  1.162e-02  -1.235  0.21698   
## followers         2.214e-08  1.466e-08   1.510  0.13093   
## acousticness     -2.434e+00  7.947e-01  -3.063  0.00219 **
## danceability     -1.148e+00  1.025e+00  -1.120  0.26256   
## duration_ms       4.133e-06  1.359e-06   3.041  0.00236 **
## energy            9.182e-01  1.045e+00   0.879  0.37943   
## instrumentalness -6.522e-01  7.453e-01  -0.875  0.38152   
## key1              6.540e-02  5.124e-01   0.128  0.89842   
## key2             -5.461e-02  4.016e-01  -0.136  0.89184   
## key3             -3.009e-01  7.241e-01  -0.415  0.67779   
## key4             -7.231e-02  4.818e-01  -0.150  0.88071   
## key5              4.705e-01  4.582e-01   1.027  0.30451   
## key6              2.705e-02  5.314e-01   0.051  0.95941   
## key7             -6.661e-01  4.574e-01  -1.456  0.14531   
## key8             -6.247e-02  6.038e-01  -0.103  0.91760   
## key9             -2.615e-02  4.050e-01  -0.065  0.94852   
## key10            -7.590e-01  6.245e-01  -1.215  0.22424   
## key11             5.312e-01  4.366e-01   1.217  0.22373   
## liveness         -2.863e-01  6.027e-01  -0.475  0.63473   
## loudness         -1.356e-01  7.110e-02  -1.907  0.05653 . 
## mode1             1.086e-01  2.464e-01   0.441  0.65938   
## tempo            -5.357e-04  3.872e-03  -0.138  0.88997   
## time_signature3   1.455e+01  1.014e+03   0.014  0.98855   
## time_signature4   1.497e+01  1.014e+03   0.015  0.98822   
## time_signature5   1.095e+00  1.186e+03   0.001  0.99926   
## valence           1.388e+00  5.994e-01   2.316  0.02058 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 669.04  on 692  degrees of freedom
## Residual deviance: 607.29  on 666  degrees of freedom
## AIC: 661.29
## 
## Number of Fisher Scoring iterations: 14

What does this model say?

Stepwise Variable Selection

Utilizing stepwise variable selection:

# Stepwise variable selection

log_back = stepAIC(logistic, direction = "backward")
## Start:  AIC=661.29
## IsWinner ~ Year + followers + acousticness + danceability + duration_ms + 
##     energy + instrumentalness + key + liveness + loudness + mode + 
##     tempo + time_signature + valence
## 
##                    Df Deviance    AIC
## - key              11   618.34 650.34
## - time_signature    3   611.14 659.14
## - tempo             1   607.31 659.31
## - mode              1   607.49 659.49
## - liveness          1   607.52 659.52
## - energy            1   608.07 660.07
## - instrumentalness  1   608.11 660.11
## - danceability      1   608.55 660.55
## - Year              1   608.82 660.82
## <none>                  607.29 661.29
## - followers         1   609.45 661.45
## - loudness          1   610.93 662.93
## - valence           1   612.74 664.74
## - duration_ms       1   617.69 669.69
## - acousticness      1   618.15 670.15
## 
## Step:  AIC=650.34
## IsWinner ~ Year + followers + acousticness + danceability + duration_ms + 
##     energy + instrumentalness + liveness + loudness + mode + 
##     tempo + time_signature + valence
## 
##                    Df Deviance    AIC
## - time_signature    3   621.85 647.85
## - mode              1   618.35 648.35
## - tempo             1   618.38 648.38
## - liveness          1   618.62 648.62
## - instrumentalness  1   618.90 648.90
## - Year              1   619.26 649.26
## - energy            1   619.29 649.29
## - danceability      1   619.31 649.31
## <none>                  618.34 650.34
## - loudness          1   621.68 651.68
## - followers         1   622.03 652.03
## - valence           1   624.85 654.85
## - acousticness      1   628.14 658.14
## - duration_ms       1   628.95 658.95
## 
## Step:  AIC=647.85
## IsWinner ~ Year + followers + acousticness + danceability + duration_ms + 
##     energy + instrumentalness + liveness + loudness + mode + 
##     tempo + valence
## 
##                    Df Deviance    AIC
## - mode              1   621.87 645.87
## - tempo             1   621.89 645.89
## - liveness          1   622.08 646.08
## - danceability      1   622.64 646.64
## - Year              1   622.79 646.79
## - instrumentalness  1   623.00 647.00
## - energy            1   623.10 647.10
## <none>                  621.85 647.85
## - loudness          1   625.48 649.48
## - followers         1   625.76 649.76
## - valence           1   628.38 652.38
## - acousticness      1   631.90 655.90
## - duration_ms       1   631.95 655.95
## 
## Step:  AIC=645.87
## IsWinner ~ Year + followers + acousticness + danceability + duration_ms + 
##     energy + instrumentalness + liveness + loudness + tempo + 
##     valence
## 
##                    Df Deviance    AIC
## - tempo             1   621.92 643.92
## - liveness          1   622.11 644.11
## - danceability      1   622.69 644.69
## - Year              1   622.87 644.87
## - instrumentalness  1   623.05 645.05
## - energy            1   623.11 645.11
## <none>                  621.87 645.87
## - loudness          1   625.50 647.50
## - followers         1   625.76 647.76
## - valence           1   628.46 650.46
## - acousticness      1   631.91 653.91
## - duration_ms       1   631.98 653.98
## 
## Step:  AIC=643.92
## IsWinner ~ Year + followers + acousticness + danceability + duration_ms + 
##     energy + instrumentalness + liveness + loudness + valence
## 
##                    Df Deviance    AIC
## - liveness          1   622.17 642.17
## - Year              1   622.89 642.89
## - danceability      1   622.99 642.99
## - instrumentalness  1   623.08 643.08
## - energy            1   623.19 643.19
## <none>                  621.92 643.92
## - loudness          1   625.56 645.56
## - followers         1   625.81 645.81
## - valence           1   628.86 648.86
## - acousticness      1   631.92 651.92
## - duration_ms       1   632.03 652.03
## 
## Step:  AIC=642.17
## IsWinner ~ Year + followers + acousticness + danceability + duration_ms + 
##     energy + instrumentalness + loudness + valence
## 
##                    Df Deviance    AIC
## - danceability      1   623.11 641.11
## - Year              1   623.14 641.14
## - instrumentalness  1   623.26 641.26
## - energy            1   623.28 641.28
## <none>                  622.17 642.17
## - loudness          1   625.63 643.63
## - followers         1   626.16 644.16
## - valence           1   629.05 647.05
## - duration_ms       1   632.39 650.39
## - acousticness      1   632.42 650.42
## 
## Step:  AIC=641.11
## IsWinner ~ Year + followers + acousticness + duration_ms + energy + 
##     instrumentalness + loudness + valence
## 
##                    Df Deviance    AIC
## - instrumentalness  1   624.17 640.17
## - Year              1   624.36 640.36
## - energy            1   624.65 640.65
## <none>                  623.11 641.11
## - loudness          1   626.44 642.44
## - followers         1   627.15 643.15
## - valence           1   629.18 645.18
## - acousticness      1   633.25 649.25
## - duration_ms       1   634.22 650.22
## 
## Step:  AIC=640.17
## IsWinner ~ Year + followers + acousticness + duration_ms + energy + 
##     loudness + valence
## 
##                Df Deviance    AIC
## - energy        1   625.38 639.38
## - Year          1   625.84 639.84
## <none>              624.17 640.17
## - loudness      1   626.90 640.90
## - followers     1   628.51 642.51
## - valence       1   630.69 644.69
## - acousticness  1   634.17 648.17
## - duration_ms   1   634.32 648.32
## 
## Step:  AIC=639.38
## IsWinner ~ Year + followers + acousticness + duration_ms + loudness + 
##     valence
## 
##                Df Deviance    AIC
## - loudness      1   626.91 638.91
## - Year          1   627.23 639.23
## <none>              625.38 639.38
## - followers     1   629.35 641.35
## - valence       1   633.24 645.24
## - duration_ms   1   636.50 648.50
## - acousticness  1   640.70 652.70
## 
## Step:  AIC=638.91
## IsWinner ~ Year + followers + acousticness + duration_ms + valence
## 
##                Df Deviance    AIC
## <none>              626.91 638.91
## - Year          1   630.12 640.12
## - followers     1   630.46 640.46
## - valence       1   633.88 643.88
## - duration_ms   1   638.84 648.84
## - acousticness  1   641.76 651.76
log_for = stepAIC(logistic, direction = "forward")
## Start:  AIC=661.29
## IsWinner ~ Year + followers + acousticness + danceability + duration_ms + 
##     energy + instrumentalness + key + liveness + loudness + mode + 
##     tempo + time_signature + valence
log_both =  stepAIC(logistic, direction = "both")
## Start:  AIC=661.29
## IsWinner ~ Year + followers + acousticness + danceability + duration_ms + 
##     energy + instrumentalness + key + liveness + loudness + mode + 
##     tempo + time_signature + valence
## 
##                    Df Deviance    AIC
## - key              11   618.34 650.34
## - time_signature    3   611.14 659.14
## - tempo             1   607.31 659.31
## - mode              1   607.49 659.49
## - liveness          1   607.52 659.52
## - energy            1   608.07 660.07
## - instrumentalness  1   608.11 660.11
## - danceability      1   608.55 660.55
## - Year              1   608.82 660.82
## <none>                  607.29 661.29
## - followers         1   609.45 661.45
## - loudness          1   610.93 662.93
## - valence           1   612.74 664.74
## - duration_ms       1   617.69 669.69
## - acousticness      1   618.15 670.15
## 
## Step:  AIC=650.34
## IsWinner ~ Year + followers + acousticness + danceability + duration_ms + 
##     energy + instrumentalness + liveness + loudness + mode + 
##     tempo + time_signature + valence
## 
##                    Df Deviance    AIC
## - time_signature    3   621.85 647.85
## - mode              1   618.35 648.35
## - tempo             1   618.38 648.38
## - liveness          1   618.62 648.62
## - instrumentalness  1   618.90 648.90
## - Year              1   619.26 649.26
## - energy            1   619.29 649.29
## - danceability      1   619.31 649.31
## <none>                  618.34 650.34
## - loudness          1   621.68 651.68
## - followers         1   622.03 652.03
## - valence           1   624.85 654.85
## - acousticness      1   628.14 658.14
## - duration_ms       1   628.95 658.95
## + key              11   607.29 661.29
## 
## Step:  AIC=647.85
## IsWinner ~ Year + followers + acousticness + danceability + duration_ms + 
##     energy + instrumentalness + liveness + loudness + mode + 
##     tempo + valence
## 
##                    Df Deviance    AIC
## - mode              1   621.87 645.87
## - tempo             1   621.89 645.89
## - liveness          1   622.08 646.08
## - danceability      1   622.64 646.64
## - Year              1   622.79 646.79
## - instrumentalness  1   623.00 647.00
## - energy            1   623.10 647.10
## <none>                  621.85 647.85
## - loudness          1   625.48 649.48
## - followers         1   625.76 649.76
## + time_signature    3   618.34 650.34
## - valence           1   628.38 652.38
## - acousticness      1   631.90 655.90
## - duration_ms       1   631.95 655.95
## + key              11   611.14 659.14
## 
## Step:  AIC=645.87
## IsWinner ~ Year + followers + acousticness + danceability + duration_ms + 
##     energy + instrumentalness + liveness + loudness + tempo + 
##     valence
## 
##                    Df Deviance    AIC
## - tempo             1   621.92 643.92
## - liveness          1   622.11 644.11
## - danceability      1   622.69 644.69
## - Year              1   622.87 644.87
## - instrumentalness  1   623.05 645.05
## - energy            1   623.11 645.11
## <none>                  621.87 645.87
## - loudness          1   625.50 647.50
## - followers         1   625.76 647.76
## + mode              1   621.85 647.85
## + time_signature    3   618.35 648.35
## - valence           1   628.46 650.46
## - acousticness      1   631.91 653.91
## - duration_ms       1   631.98 653.98
## + key              11   611.41 657.41
## 
## Step:  AIC=643.92
## IsWinner ~ Year + followers + acousticness + danceability + duration_ms + 
##     energy + instrumentalness + liveness + loudness + valence
## 
##                    Df Deviance    AIC
## - liveness          1   622.17 642.17
## - Year              1   622.89 642.89
## - danceability      1   622.99 642.99
## - instrumentalness  1   623.08 643.08
## - energy            1   623.19 643.19
## <none>                  621.92 643.92
## - loudness          1   625.56 645.56
## - followers         1   625.81 645.81
## + tempo             1   621.87 645.87
## + mode              1   621.89 645.89
## + time_signature    3   618.39 646.39
## - valence           1   628.86 648.86
## - acousticness      1   631.92 651.92
## - duration_ms       1   632.03 652.03
## + key              11   611.42 655.42
## 
## Step:  AIC=642.17
## IsWinner ~ Year + followers + acousticness + danceability + duration_ms + 
##     energy + instrumentalness + loudness + valence
## 
##                    Df Deviance    AIC
## - danceability      1   623.11 641.11
## - Year              1   623.14 641.14
## - instrumentalness  1   623.26 641.26
## - energy            1   623.28 641.28
## <none>                  622.17 642.17
## - loudness          1   625.63 643.63
## + liveness          1   621.92 643.92
## + tempo             1   622.11 644.11
## + mode              1   622.13 644.13
## - followers         1   626.16 644.16
## + time_signature    3   618.68 644.68
## - valence           1   629.05 647.05
## - duration_ms       1   632.39 650.39
## - acousticness      1   632.42 650.42
## + key              11   611.64 653.64
## 
## Step:  AIC=641.11
## IsWinner ~ Year + followers + acousticness + duration_ms + energy + 
##     instrumentalness + loudness + valence
## 
##                    Df Deviance    AIC
## - instrumentalness  1   624.17 640.17
## - Year              1   624.36 640.36
## - energy            1   624.65 640.65
## <none>                  623.11 641.11
## + danceability      1   622.17 642.17
## - loudness          1   626.44 642.44
## + tempo             1   622.82 642.82
## + liveness          1   622.99 642.99
## + mode              1   623.03 643.03
## - followers         1   627.15 643.15
## + time_signature    3   619.78 643.78
## - valence           1   629.18 645.18
## - acousticness      1   633.25 649.25
## - duration_ms       1   634.22 650.22
## + key              11   612.71 652.71
## 
## Step:  AIC=640.17
## IsWinner ~ Year + followers + acousticness + duration_ms + energy + 
##     loudness + valence
## 
##                    Df Deviance    AIC
## - energy            1   625.38 639.38
## - Year              1   625.84 639.84
## <none>                  624.17 640.17
## - loudness          1   626.90 640.90
## + instrumentalness  1   623.11 641.11
## + danceability      1   623.26 641.26
## + tempo             1   623.93 641.93
## + mode              1   624.06 642.06
## + liveness          1   624.08 642.08
## + time_signature    3   620.25 642.25
## - followers         1   628.51 642.51
## - valence           1   630.69 644.69
## - acousticness      1   634.17 648.17
## - duration_ms       1   634.32 648.32
## + key              11   614.20 652.20
## 
## Step:  AIC=639.38
## IsWinner ~ Year + followers + acousticness + duration_ms + loudness + 
##     valence
## 
##                    Df Deviance    AIC
## - loudness          1   626.91 638.91
## - Year              1   627.23 639.23
## <none>                  625.38 639.38
## + danceability      1   624.10 640.10
## + energy            1   624.17 640.17
## + instrumentalness  1   624.65 640.65
## + tempo             1   625.01 641.01
## + time_signature    3   621.28 641.28
## + mode              1   625.33 641.33
## - followers         1   629.35 641.35
## + liveness          1   625.38 641.38
## - valence           1   633.24 645.24
## - duration_ms       1   636.50 648.50
## + key              11   615.09 651.09
## - acousticness      1   640.70 652.70
## 
## Step:  AIC=638.91
## IsWinner ~ Year + followers + acousticness + duration_ms + valence
## 
##                    Df Deviance    AIC
## <none>                  626.91 638.91
## + loudness          1   625.38 639.38
## - Year              1   630.12 640.12
## + danceability      1   626.13 640.13
## + instrumentalness  1   626.45 640.45
## - followers         1   630.46 640.46
## + tempo             1   626.64 640.64
## + mode              1   626.85 640.85
## + liveness          1   626.90 640.90
## + energy            1   626.90 640.90
## + time_signature    3   622.92 640.92
## - valence           1   633.88 643.88
## - duration_ms       1   638.84 648.84
## + key              11   617.06 651.06
## - acousticness      1   641.76 651.76

Possibly think about separating the 3 models and discussing them independently (Because the output it very large, would be difficult for the reader to scroll up and reference)

Reduced Model

Fitting the reduced model (Is this according to the stepwise??) Discuss why we chose the features that we did for the reduced model:

# Fitting the reduced model

logistic_reduced = glm(IsWinner ~ danceability + loudness + followers + valence + duration_ms + acousticness, data = training_set,  family = "binomial")

summary(logistic_reduced)
## 
## Call:
## glm(formula = IsWinner ~ danceability + loudness + followers + 
##     valence + duration_ms + acousticness, family = "binomial", 
##     data = training_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4000  -0.6855  -0.5664  -0.2906   2.4843  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.152e+00  6.570e-01  -4.798  1.6e-06 ***
## danceability -1.180e+00  8.977e-01  -1.314 0.188805    
## loudness     -9.536e-02  5.062e-02  -1.884 0.059591 .  
## followers     2.991e-08  1.377e-08   2.172 0.029850 *  
## valence       1.759e+00  5.564e-01   3.162 0.001568 ** 
## duration_ms   3.767e-06  1.258e-06   2.995 0.002744 ** 
## acousticness -2.635e+00  7.230e-01  -3.644 0.000268 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 669.04  on 692  degrees of freedom
## Residual deviance: 625.51  on 686  degrees of freedom
## AIC: 639.51
## 
## Number of Fisher Scoring iterations: 5

Summarize what this model shows:

Predicting

Making the predictions:

# Computing predictions

logistic_predictions = predict(logistic_reduced, newdata = test_set[,c(-1, -2)], type = "response")

Inspect the predictions using different thresholds: .2, .3, and .4. Talk about what a threshold is and what changing it does.

Threshold = 0.2
# Threshold = 0.2

logistic_predictions_02 = ifelse(logistic_predictions > 0.2, 1, 0)
logistic_accuracy_02 = sum(logistic_predictions_02 == test_set[2]) / dim(test_set[2])[1]

table(test_set$IsWinner, logistic_predictions_02)
##    logistic_predictions_02
##      0  1
##   0 94 56
##   1 11 13
false_positive_logistic_02 = table(test_set$IsWinner, logistic_predictions_02)[3]
negative_logistic_02 = table(test_set$IsWinner, logistic_predictions_02)[1] + table(test_set$IsWinner, logistic_predictions_02)[2]
typeIerror_logistic_02 = false_positive_logistic_02 / negative_logistic_02

true_positive_logistic_02 = table(test_set$IsWinner, logistic_predictions_02)[4]
positive_logistic_02 = table(test_set$IsWinner, logistic_predictions_02)[2] + table(test_set$IsWinner, logistic_predictions_02)[4]
sensitivity_logistic_02 = true_positive_logistic_02 / positive_logistic_02

Discuss what we see from this threshold:

ASK CRISTIAN: Should we just display the TypeIError here, or is it used somewhere else? Maybe we can make a table after looking at all the thresholds? Does that make sense in this context?

Threshold = 0.3
# Threshold = 0.3

logistic_predictions_03 = ifelse(logistic_predictions > 0.3, 1, 0)
logistic_accuracy_03 = sum(logistic_predictions_03 == test_set[2]) / dim(test_set[2])[1]

table(test_set$IsWinner, logistic_predictions_03)
##    logistic_predictions_03
##       0   1
##   0 140  10
##   1  22   2
false_positive_logistic_03 = table(test_set$IsWinner, logistic_predictions_03)[3]
negative_logistic_03 = table(test_set$IsWinner, logistic_predictions_03)[1] + table(test_set$IsWinner, logistic_predictions_03)[2]
typeIerror_logistic_03 = false_positive_logistic_03 / negative_logistic_03

true_positive_logistic_03 = table(test_set$IsWinner, logistic_predictions_03)[4]
positive_logistic_03 = table(test_set$IsWinner, logistic_predictions_03)[2] + table(test_set$IsWinner, logistic_predictions_03)[4]
sensitivity_logistic_03 = true_positive_logistic_03 / positive_logistic_03

Discuss what we see from this threshold, incomparison to 0.2.

Threshold = 0.4
# Threshold = 0.4

logistic_predictions_04 = ifelse(logistic_predictions > 0.4, 1, 0)
logistic_accuracy_04 = sum(logistic_predictions_04 == test_set[2]) / dim(test_set[2])[1]

table(test_set$IsWinner, logistic_predictions_04)
##    logistic_predictions_04
##       0   1
##   0 148   2
##   1  23   1
false_positive_logistic_04 = table(test_set$IsWinner, logistic_predictions_04)[3]
negative_logistic_04 = table(test_set$IsWinner, logistic_predictions_04)[1] + table(test_set$IsWinner, logistic_predictions_04)[2]
typeIerror_logistic_04 = false_positive_logistic_04 / negative_logistic_04

true_positive_logistic_04 = table(test_set$IsWinner, logistic_predictions_04)[4]
positive_logistic_04 = table(test_set$IsWinner, logistic_predictions_04)[2] + table(test_set$IsWinner, logistic_predictions_04)[4]
sensitivity_logistic_04 = true_positive_logistic_04 / positive_logistic_04

Talk about what this threshold shows and how it compares to 0.2 and 0.3. Talk about why we don’t raise the threshold any more:

ROC Curve

Take a look at the ROC curve of the best threshold?

# ROC curve

roc.out <- roc(test_set$IsWinner, logistic_predictions)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc.out, print.auc=TRUE, legacy.axes=TRUE, xlab="False positive rate", ylab="True positive rate")

auc(roc.out)
## Area under the curve: 0.5881

Discuss what the ROC curve shows.

Logistic Oversampled Model

Explain the Logistic Oversampled Model and discuss its benefits (and weaknesses?)

# Fitting logistic oversampled 

logistic_over = glm(as.numeric(unlist(oversampled_train_data[1])) ~ ., data = oversampled_train_data[-1], family = "binomial")
summary(logistic_over)
## 
## Call:
## glm(formula = as.numeric(unlist(oversampled_train_data[1])) ~ 
##     ., family = "binomial", data = oversampled_train_data[-1])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.8239  -1.0573  -0.2236   1.0233   1.9601  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       1.206e+01  9.351e+02   0.013   0.9897    
## Year             -1.632e-02  7.655e-03  -2.132   0.0330 *  
## followers         2.096e-08  1.071e-08   1.957   0.0503 .  
## acousticness     -2.965e+00  5.074e-01  -5.843 5.12e-09 ***
## danceability     -2.061e-01  6.485e-01  -0.318   0.7506    
## duration_ms       7.219e-06  1.061e-06   6.805 1.01e-11 ***
## energy            4.785e-01  6.896e-01   0.694   0.4878    
## instrumentalness -9.629e-01  5.174e-01  -1.861   0.0627 .  
## key1              1.855e-01  3.309e-01   0.561   0.5750    
## key2              8.622e-02  2.687e-01   0.321   0.7483    
## key3             -6.261e-01  5.038e-01  -1.243   0.2140    
## key4             -4.333e-02  3.239e-01  -0.134   0.8936    
## key5              6.033e-01  3.076e-01   1.961   0.0499 *  
## key6              2.045e-01  3.505e-01   0.583   0.5597    
## key7             -4.914e-01  2.862e-01  -1.717   0.0859 .  
## key8              2.044e-01  3.882e-01   0.526   0.5986    
## key9              8.351e-02  2.701e-01   0.309   0.7572    
## key10            -6.139e-01  3.853e-01  -1.593   0.1111    
## key11             5.927e-01  3.037e-01   1.952   0.0510 .  
## liveness         -3.785e-01  3.950e-01  -0.958   0.3379    
## loudness         -1.144e-01  4.579e-02  -2.498   0.0125 *  
## mode1             9.710e-02  1.598e-01   0.608   0.5433    
## tempo             2.018e-03  2.655e-03   0.760   0.4473    
## time_signature3   1.635e+01  9.350e+02   0.017   0.9861    
## time_signature4   1.699e+01  9.350e+02   0.018   0.9855    
## time_signature5   2.024e+00  1.095e+03   0.002   0.9985    
## valence           1.633e+00  4.025e-01   4.056 4.99e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1525.7  on 1100  degrees of freedom
## Residual deviance: 1319.2  on 1074  degrees of freedom
## AIC: 1373.2
## 
## Number of Fisher Scoring iterations: 14

What does this model summary say? What do we explore next?

Stepwise?

What are we looking at in this next part with the stepwise again?

log_over_back = stepAIC(logistic_over, direction = "backward")
## Start:  AIC=1373.16
## as.numeric(unlist(oversampled_train_data[1])) ~ Year + followers + 
##     acousticness + danceability + duration_ms + energy + instrumentalness + 
##     key + liveness + loudness + mode + tempo + time_signature + 
##     valence
## 
##                    Df Deviance    AIC
## - danceability      1   1319.3 1371.3
## - mode              1   1319.5 1371.5
## - energy            1   1319.6 1371.6
## - tempo             1   1319.7 1371.7
## - liveness          1   1320.1 1372.1
## <none>                  1319.2 1373.2
## - instrumentalness  1   1322.7 1374.7
## - followers         1   1323.1 1375.1
## - Year              1   1323.7 1375.7
## - loudness          1   1325.5 1377.5
## - key              11   1345.9 1377.9
## - time_signature    3   1332.5 1380.5
## - valence           1   1335.9 1387.9
## - acousticness      1   1358.3 1410.3
## - duration_ms       1   1373.8 1425.8
## 
## Step:  AIC=1371.26
## as.numeric(unlist(oversampled_train_data[1])) ~ Year + followers + 
##     acousticness + duration_ms + energy + instrumentalness + 
##     key + liveness + loudness + mode + tempo + time_signature + 
##     valence
## 
##                    Df Deviance    AIC
## - mode              1   1319.7 1369.7
## - energy            1   1319.8 1369.8
## - liveness          1   1320.1 1370.1
## - tempo             1   1320.2 1370.2
## <none>                  1319.3 1371.3
## - instrumentalness  1   1322.8 1372.8
## - followers         1   1323.2 1373.2
## - Year              1   1324.0 1374.0
## - loudness          1   1325.5 1375.5
## - key              11   1345.9 1375.9
## - time_signature    3   1332.6 1378.6
## - valence           1   1339.0 1389.0
## - acousticness      1   1358.5 1408.5
## - duration_ms       1   1375.3 1425.3
## 
## Step:  AIC=1369.66
## as.numeric(unlist(oversampled_train_data[1])) ~ Year + followers + 
##     acousticness + duration_ms + energy + instrumentalness + 
##     key + liveness + loudness + tempo + time_signature + valence
## 
##                    Df Deviance    AIC
## - energy            1   1320.2 1368.2
## - liveness          1   1320.6 1368.6
## - tempo             1   1320.7 1368.7
## <none>                  1319.7 1369.7
## - instrumentalness  1   1323.3 1371.3
## - followers         1   1323.4 1371.4
## - Year              1   1324.7 1372.7
## - loudness          1   1325.8 1373.8
## - key              11   1346.1 1374.1
## - time_signature    3   1333.0 1377.0
## - valence           1   1339.3 1387.3
## - acousticness      1   1358.6 1406.6
## - duration_ms       1   1375.3 1423.3
## 
## Step:  AIC=1368.18
## as.numeric(unlist(oversampled_train_data[1])) ~ Year + followers + 
##     acousticness + duration_ms + instrumentalness + key + liveness + 
##     loudness + tempo + time_signature + valence
## 
##                    Df Deviance    AIC
## - liveness          1   1320.9 1366.9
## - tempo             1   1321.3 1367.3
## <none>                  1320.2 1368.2
## - instrumentalness  1   1323.5 1369.5
## - followers         1   1323.7 1369.7
## - Year              1   1325.5 1371.5
## - key              11   1346.4 1372.4
## - loudness          1   1327.1 1373.1
## - time_signature    3   1333.9 1375.9
## - valence           1   1343.0 1389.0
## - acousticness      1   1369.5 1415.5
## - duration_ms       1   1378.3 1424.3
## 
## Step:  AIC=1366.86
## as.numeric(unlist(oversampled_train_data[1])) ~ Year + followers + 
##     acousticness + duration_ms + instrumentalness + key + loudness + 
##     tempo + time_signature + valence
## 
##                    Df Deviance    AIC
## - tempo             1   1321.9 1365.9
## <none>                  1320.9 1366.9
## - instrumentalness  1   1324.1 1368.1
## - followers         1   1324.4 1368.4
## - Year              1   1326.0 1370.0
## - key              11   1346.8 1370.8
## - loudness          1   1327.7 1371.7
## - time_signature    3   1334.5 1374.5
## - valence           1   1343.7 1387.7
## - acousticness      1   1370.1 1414.1
## - duration_ms       1   1379.3 1423.3
## 
## Step:  AIC=1365.87
## as.numeric(unlist(oversampled_train_data[1])) ~ Year + followers + 
##     acousticness + duration_ms + instrumentalness + key + loudness + 
##     time_signature + valence
## 
##                    Df Deviance    AIC
## <none>                  1321.9 1365.9
## - instrumentalness  1   1324.8 1366.8
## - followers         1   1325.2 1367.2
## - Year              1   1326.7 1368.7
## - loudness          1   1328.3 1370.3
## - key              11   1349.4 1371.4
## - time_signature    3   1335.4 1373.4
## - valence           1   1344.6 1386.6
## - acousticness      1   1371.7 1413.7
## - duration_ms       1   1380.5 1422.5
log_over_for = stepAIC(logistic_over, direction = "forward")
## Start:  AIC=1373.16
## as.numeric(unlist(oversampled_train_data[1])) ~ Year + followers + 
##     acousticness + danceability + duration_ms + energy + instrumentalness + 
##     key + liveness + loudness + mode + tempo + time_signature + 
##     valence
log_over_both =  stepAIC(logistic_over, direction = "both")
## Start:  AIC=1373.16
## as.numeric(unlist(oversampled_train_data[1])) ~ Year + followers + 
##     acousticness + danceability + duration_ms + energy + instrumentalness + 
##     key + liveness + loudness + mode + tempo + time_signature + 
##     valence
## 
##                    Df Deviance    AIC
## - danceability      1   1319.3 1371.3
## - mode              1   1319.5 1371.5
## - energy            1   1319.6 1371.6
## - tempo             1   1319.7 1371.7
## - liveness          1   1320.1 1372.1
## <none>                  1319.2 1373.2
## - instrumentalness  1   1322.7 1374.7
## - followers         1   1323.1 1375.1
## - Year              1   1323.7 1375.7
## - loudness          1   1325.5 1377.5
## - key              11   1345.9 1377.9
## - time_signature    3   1332.5 1380.5
## - valence           1   1335.9 1387.9
## - acousticness      1   1358.3 1410.3
## - duration_ms       1   1373.8 1425.8
## 
## Step:  AIC=1371.26
## as.numeric(unlist(oversampled_train_data[1])) ~ Year + followers + 
##     acousticness + duration_ms + energy + instrumentalness + 
##     key + liveness + loudness + mode + tempo + time_signature + 
##     valence
## 
##                    Df Deviance    AIC
## - mode              1   1319.7 1369.7
## - energy            1   1319.8 1369.8
## - liveness          1   1320.1 1370.1
## - tempo             1   1320.2 1370.2
## <none>                  1319.3 1371.3
## - instrumentalness  1   1322.8 1372.8
## + danceability      1   1319.2 1373.2
## - followers         1   1323.2 1373.2
## - Year              1   1324.0 1374.0
## - loudness          1   1325.5 1375.5
## - key              11   1345.9 1375.9
## - time_signature    3   1332.6 1378.6
## - valence           1   1339.0 1389.0
## - acousticness      1   1358.5 1408.5
## - duration_ms       1   1375.3 1425.3
## 
## Step:  AIC=1369.66
## as.numeric(unlist(oversampled_train_data[1])) ~ Year + followers + 
##     acousticness + duration_ms + energy + instrumentalness + 
##     key + liveness + loudness + tempo + time_signature + valence
## 
##                    Df Deviance    AIC
## - energy            1   1320.2 1368.2
## - liveness          1   1320.6 1368.6
## - tempo             1   1320.7 1368.7
## <none>                  1319.7 1369.7
## + mode              1   1319.3 1371.3
## - instrumentalness  1   1323.3 1371.3
## - followers         1   1323.4 1371.4
## + danceability      1   1319.5 1371.5
## - Year              1   1324.7 1372.7
## - loudness          1   1325.8 1373.8
## - key              11   1346.1 1374.1
## - time_signature    3   1333.0 1377.0
## - valence           1   1339.3 1387.3
## - acousticness      1   1358.6 1406.6
## - duration_ms       1   1375.3 1423.3
## 
## Step:  AIC=1368.18
## as.numeric(unlist(oversampled_train_data[1])) ~ Year + followers + 
##     acousticness + duration_ms + instrumentalness + key + liveness + 
##     loudness + tempo + time_signature + valence
## 
##                    Df Deviance    AIC
## - liveness          1   1320.9 1366.9
## - tempo             1   1321.3 1367.3
## <none>                  1320.2 1368.2
## - instrumentalness  1   1323.5 1369.5
## + energy            1   1319.7 1369.7
## - followers         1   1323.7 1369.7
## + mode              1   1319.8 1369.8
## + danceability      1   1320.0 1370.0
## - Year              1   1325.5 1371.5
## - key              11   1346.4 1372.4
## - loudness          1   1327.1 1373.1
## - time_signature    3   1333.9 1375.9
## - valence           1   1343.0 1389.0
## - acousticness      1   1369.5 1415.5
## - duration_ms       1   1378.3 1424.3
## 
## Step:  AIC=1366.86
## as.numeric(unlist(oversampled_train_data[1])) ~ Year + followers + 
##     acousticness + duration_ms + instrumentalness + key + loudness + 
##     tempo + time_signature + valence
## 
##                    Df Deviance    AIC
## - tempo             1   1321.9 1365.9
## <none>                  1320.9 1366.9
## - instrumentalness  1   1324.1 1368.1
## + liveness          1   1320.2 1368.2
## - followers         1   1324.4 1368.4
## + mode              1   1320.4 1368.4
## + energy            1   1320.6 1368.6
## + danceability      1   1320.8 1368.8
## - Year              1   1326.0 1370.0
## - key              11   1346.8 1370.8
## - loudness          1   1327.7 1371.7
## - time_signature    3   1334.5 1374.5
## - valence           1   1343.7 1387.7
## - acousticness      1   1370.1 1414.1
## - duration_ms       1   1379.3 1423.3
## 
## Step:  AIC=1365.87
## as.numeric(unlist(oversampled_train_data[1])) ~ Year + followers + 
##     acousticness + duration_ms + instrumentalness + key + loudness + 
##     time_signature + valence
## 
##                    Df Deviance    AIC
## <none>                  1321.9 1365.9
## - instrumentalness  1   1324.8 1366.8
## + tempo             1   1320.9 1366.9
## - followers         1   1325.2 1367.2
## + mode              1   1321.3 1367.3
## + liveness          1   1321.3 1367.3
## + energy            1   1321.4 1367.4
## + danceability      1   1321.4 1367.4
## - Year              1   1326.7 1368.7
## - loudness          1   1328.3 1370.3
## - key              11   1349.4 1371.4
## - time_signature    3   1335.4 1373.4
## - valence           1   1344.6 1386.6
## - acousticness      1   1371.7 1413.7
## - duration_ms       1   1380.5 1422.5
response_variable_over = as.numeric(unlist(oversampled_train_data[1]))

reduced_variables_over = as.matrix(oversampled_train_data[,c(2, 3, 4, 6, 8, 9, 11, 14, 15)], ncol = 9)

reduced_train_data_over = matrix(c(
  response_variable_over,
  as.numeric(reduced_variables_over[,1]),
  as.numeric(reduced_variables_over[,2]),
  as.numeric(reduced_variables_over[,3]),
  as.numeric(reduced_variables_over[,4]),
  as.numeric(reduced_variables_over[,5]),
  as.factor(reduced_variables_over[,6]),
  as.numeric(reduced_variables_over[,7]),
  as.factor(reduced_variables_over[,8]),
  as.numeric(reduced_variables_over[,9])
), ncol = 10)
colnames(reduced_train_data_over) = c("IsWinner", "Year", "followers", "acousticness", "duration_ms",
                                     "instrumentalness", "key", "loudness", "time_signature", "valence" )


logistic_reduced_over = glm(response_variable_over ~ Year + followers + acousticness 
                            + duration_ms + instrumentalness + key + loudness + 
                            + time_signature + valence, data = oversampled_train_data,
                            family = "binomial")

logistic_over_predictions = predict(logistic_reduced_over, newdata = test_set[,c(-1, -2)], type = "response")

Discuss the above code.

Testing more thresholds: 0.2, 0.3, 0.4.

Threshold = 0.2
# Threshold = 0.2

logistic_over_predictions_02 = ifelse(logistic_over_predictions > 0.2, 1, 0)
logistic_over_accuracy_02 = sum(logistic_over_predictions_02 == test_set[2]) / dim(test_set[2])[1]

table(test_set$IsWinner, logistic_over_predictions_02)
##    logistic_over_predictions_02
##       0   1
##   0  23 127
##   1   2  22
false_positive_logistic_over_02 = table(test_set$IsWinner, logistic_over_predictions_02)[3]
negative_logistic_over_02 = table(test_set$IsWinner, logistic_over_predictions_02)[1] + table(test_set$IsWinner, logistic_over_predictions_02)[2]
typeIerror_logistic_over_02 = false_positive_logistic_over_02 / negative_logistic_over_02

true_positive_logistic_over_02 = table(test_set$IsWinner, logistic_over_predictions_02)[4]
positive_logistic_over_02 = table(test_set$IsWinner, logistic_over_predictions_02)[2] + table(test_set$IsWinner, logistic_over_predictions_02)[4]
sensitivity_logistic_over_02 = true_positive_logistic_over_02 / positive_logistic_over_02

Discuss this threshold.

Threshold = 0.3
logistic_over_predictions_03 = ifelse(logistic_over_predictions > 0.3, 1, 0)
logistic_over_accuracy_03 = sum(logistic_over_predictions_03 == test_set[2]) / dim(test_set[2])[1]

table(test_set$IsWinner, logistic_over_predictions_03)
##    logistic_over_predictions_03
##       0   1
##   0  47 103
##   1   5  19
false_positive_logistic_over_03 = table(test_set$IsWinner, logistic_over_predictions_03)[3]
negative_logistic_over_03 = table(test_set$IsWinner, logistic_over_predictions_03)[1] + table(test_set$IsWinner, logistic_over_predictions_03)[2]
typeIerror_logistic_over_03 = false_positive_logistic_over_03 / negative_logistic_over_03

true_positive_logistic_over_03 = table(test_set$IsWinner, logistic_over_predictions_03)[4]
positive_logistic_over_03 = table(test_set$IsWinner, logistic_over_predictions_03)[2] + table(test_set$IsWinner, logistic_over_predictions_03)[4]
sensitivity_logistic_over_03 = true_positive_logistic_over_03 / positive_logistic_over_03

Discuss what this threshold shows, compared to 0.2

Threshold = 0.4
logistic_over_predictions_04 = ifelse(logistic_over_predictions > 0.4, 1, 0)
logistic_over_accuracy_04 = sum(logistic_over_predictions_04 == test_set[2]) / dim(test_set[2])[1]

table(test_set$IsWinner, logistic_over_predictions_04)
##    logistic_over_predictions_04
##      0  1
##   0 74 76
##   1  6 18
false_positive_logistic_over_04 = table(test_set$IsWinner, logistic_over_predictions_04)[3]
negative_logistic_over_04 = table(test_set$IsWinner, logistic_over_predictions_04)[1] + table(test_set$IsWinner, logistic_over_predictions_04)[2]
typeIerror_logistic_over_04 = false_positive_logistic_over_04 / negative_logistic_over_04

true_positive_logistic_over_04 = table(test_set$IsWinner, logistic_over_predictions_04)[4]
positive_logistic_over_04 = table(test_set$IsWinner, logistic_over_predictions_04)[2] + table(test_set$IsWinner, logistic_over_predictions_04)[4]
sensitivity_logistic_over_04 = true_positive_logistic_over_04 / positive_logistic_over_04

Final discussion of the thresholds and how they compare:

ROC Curve

ROC Curve of this model:

# ROC curve

roc.out <- roc(test_set$IsWinner, logistic_over_predictions)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc.out, print.auc=TRUE, legacy.axes=TRUE, xlab="False positive rate", ylab="True positive rate")

auc(roc.out)
## Area under the curve: 0.6047

Discuss this ROC curve:

Linear Discriminant Analysis

Discuss this approach and why we are doing it:

We examine the normality of the predictors conditioned to the classes of the variable IsWinner through the use of the Shapiro Test. The Shapiro Test is… and shows…

shapiro.test(danceability[IsWinner == 0]) # Yes
## 
##  Shapiro-Wilk normality test
## 
## data:  danceability[IsWinner == 0]
## W = 0.99599, p-value = 0.1629
shapiro.test(danceability[IsWinner == 1]) # Yes
## 
##  Shapiro-Wilk normality test
## 
## data:  danceability[IsWinner == 1]
## W = 0.98983, p-value = 0.4586
shapiro.test(followers[IsWinner == 0]) # No
## 
##  Shapiro-Wilk normality test
## 
## data:  followers[IsWinner == 0]
## W = 0.56164, p-value < 2.2e-16
shapiro.test(followers[IsWinner == 1]) # No
## 
##  Shapiro-Wilk normality test
## 
## data:  followers[IsWinner == 1]
## W = 0.66252, p-value = 7.101e-16
shapiro.test(acousticness[IsWinner == 0]) # No
## 
##  Shapiro-Wilk normality test
## 
## data:  acousticness[IsWinner == 0]
## W = 0.70527, p-value < 2.2e-16
shapiro.test(acousticness[IsWinner == 1]) # No
## 
##  Shapiro-Wilk normality test
## 
## data:  acousticness[IsWinner == 1]
## W = 0.56506, p-value < 2.2e-16
shapiro.test(duration_ms[IsWinner == 0]) # No
## 
##  Shapiro-Wilk normality test
## 
## data:  duration_ms[IsWinner == 0]
## W = 0.70806, p-value < 2.2e-16
shapiro.test(duration_ms[IsWinner == 1]) # No
## 
##  Shapiro-Wilk normality test
## 
## data:  duration_ms[IsWinner == 1]
## W = 0.84049, p-value = 1.513e-10
shapiro.test(energy[IsWinner == 0]) # No
## 
##  Shapiro-Wilk normality test
## 
## data:  energy[IsWinner == 0]
## W = 0.93579, p-value = 7.7e-15
shapiro.test(energy[IsWinner == 1]) # No
## 
##  Shapiro-Wilk normality test
## 
## data:  energy[IsWinner == 1]
## W = 0.91349, p-value = 4.31e-07
shapiro.test(instrumentalness[IsWinner == 0]) # No
## 
##  Shapiro-Wilk normality test
## 
## data:  instrumentalness[IsWinner == 0]
## W = 0.42748, p-value < 2.2e-16
shapiro.test(instrumentalness[IsWinner == 1]) # No
## 
##  Shapiro-Wilk normality test
## 
## data:  instrumentalness[IsWinner == 1]
## W = 0.48341, p-value < 2.2e-16
shapiro.test(liveness[IsWinner == 0]) # No
## 
##  Shapiro-Wilk normality test
## 
## data:  liveness[IsWinner == 0]
## W = 0.68008, p-value < 2.2e-16
shapiro.test(liveness[IsWinner == 1]) # No
## 
##  Shapiro-Wilk normality test
## 
## data:  liveness[IsWinner == 1]
## W = 0.71399, p-value = 1.356e-14
shapiro.test(loudness[IsWinner == 0]) # No
## 
##  Shapiro-Wilk normality test
## 
## data:  loudness[IsWinner == 0]
## W = 0.94738, p-value = 2.941e-13
shapiro.test(loudness[IsWinner == 1]) # No
## 
##  Shapiro-Wilk normality test
## 
## data:  loudness[IsWinner == 1]
## W = 0.90541, p-value = 1.513e-07
shapiro.test(tempo[IsWinner == 0]) # No
## 
##  Shapiro-Wilk normality test
## 
## data:  tempo[IsWinner == 0]
## W = 0.98439, p-value = 9.905e-06
shapiro.test(tempo[IsWinner == 1]) # No
## 
##  Shapiro-Wilk normality test
## 
## data:  tempo[IsWinner == 1]
## W = 0.95867, p-value = 0.0005596
shapiro.test(valence[IsWinner == 0]) # No
## 
##  Shapiro-Wilk normality test
## 
## data:  valence[IsWinner == 0]
## W = 0.97381, p-value = 1.724e-08
shapiro.test(valence[IsWinner == 1]) # No
## 
##  Shapiro-Wilk normality test
## 
## data:  valence[IsWinner == 1]
## W = 0.97234, p-value = 0.009262

According tot he Shapiro Test, discuss what the values mean or each of the attributes. What useful knowledge is gained from this?

Distribution(???) Plots of the Attributes

Danceability
par(mfrow = c(1, 2))

# danceability looking normal

qqnorm(danceability[IsWinner == 0])
grid()               
qqline(danceability[IsWinner == 0],lwd = 2, col = "red")

qqnorm(danceability[IsWinner == 1])
grid()               
qqline(danceability[IsWinner == 1],lwd = 2, col = "red")

What does this tell us about the danceability feature?

Followers
# followers huge right tail 

qqnorm(followers[IsWinner == 0])
grid()               
qqline(followers[IsWinner == 0],lwd = 2, col = "red")

qqnorm(followers[IsWinner == 1])
grid()               
qqline(followers[IsWinner == 1],lwd = 2, col = "red")

What does this tell us about the followers feature?

Acousticness
# acousticness S shaped

qqnorm(acousticness[IsWinner == 0])
grid()               
qqline(acousticness[IsWinner == 0],lwd = 2, col = "red")

qqnorm(acousticness[IsWinner == 1])
grid()               
qqline(acousticness[IsWinner == 1],lwd = 2, col = "red")

What does this tell us about the acousticness feature?

Duration
# duration_ms  big right tail

qqnorm(duration_ms[IsWinner == 0])
grid()               
qqline(duration_ms[IsWinner == 0],lwd = 2, col = "red")

qqnorm(duration_ms[IsWinner == 1])
grid()               
qqline(duration_ms[IsWinner == 1],lwd = 2, col = "red")

What does this tell us about the duration feature?

Energy
# energy tails not normal

qqnorm(energy[IsWinner == 0])
grid()               
qqline(energy[IsWinner == 0],lwd = 2, col = "red")

qqnorm(energy[IsWinner == 1])
grid()               
qqline(energy[IsWinner == 1],lwd = 2, col = "red")

What does this tell us about the energy feature?

Instrumentalness
# instrumentalness huge right tail

qqnorm(instrumentalness[IsWinner == 0])
grid()               
qqline(instrumentalness[IsWinner == 0],lwd = 2, col = "red")

qqnorm(instrumentalness[IsWinner == 1])
grid()               
qqline(instrumentalness[IsWinner == 1],lwd = 2, col = "red")

What does this tell us about the instrumentalness feature?

Liveness
# liveness S shaped

qqnorm(liveness[IsWinner == 0])
grid()               
qqline(liveness[IsWinner == 0],lwd = 2, col = "red")

qqnorm(liveness[IsWinner == 1])
grid()               
qqline(liveness[IsWinner == 1],lwd = 2, col = "red")

What does this tell us about the liveness feature?

Loudness
# loudness tails not normal

qqnorm(loudness[IsWinner == 0])
grid()               
qqline(loudness[IsWinner == 0],lwd = 2, col = "red")

qqnorm(loudness[IsWinner == 1])
grid()               
qqline(loudness[IsWinner == 1],lwd = 2, col = "red")

What does this tell us about the loudness feature?

Tempo
# tempo tails slightly not normal

qqnorm(tempo[IsWinner == 0])
grid()               
qqline(tempo[IsWinner == 0],lwd = 2, col = "red")

qqnorm(tempo[IsWinner == 1])
grid()               
qqline(tempo[IsWinner == 1],lwd = 2, col = "red")

What does this tell us about the tempo feature?

Valence
# valence tails slightly not normal

qqnorm(valence[IsWinner == 0])
grid()               
qqline(valence[IsWinner == 0],lwd = 2, col = "red")

qqnorm(valence[IsWinner == 1])
grid()               
qqline(valence[IsWinner == 1],lwd = 2, col = "red")

What does this tell us about the valence feature?

Overall discussion about the above plots. Highlight strange shapes, important findings, etc. How do we use what we learned?

Transformations

We apply transformations to the predictors in an attempt to make them more normal. We then test them looking again at the Shapiro test… EXPLAIN MORE

Followers
# followers plots improved, one passes the test

par(mfrow = c(1, 1))

b_followers <- boxcox(lm(followers ~ 1))

lambda <- b_followers$x[which.max(b_followers$y)]
followers_tran <- (followers ^ lambda - 1) / lambda

par(mfrow = c(1, 2))

qqnorm(followers_tran[IsWinner == 0])
grid()               
qqline(followers_tran[IsWinner == 0],lwd = 2, col = "red")

qqnorm(followers_tran[IsWinner == 1])
grid()               
qqline(followers_tran[IsWinner == 1],lwd = 2, col = "red")

shapiro.test(followers_tran[IsWinner == 0]) # No 
## 
##  Shapiro-Wilk normality test
## 
## data:  followers_tran[IsWinner == 0]
## W = 0.9908, p-value = 0.001406
shapiro.test(followers_tran[IsWinner == 1]) # Yes
## 
##  Shapiro-Wilk normality test
## 
## data:  followers_tran[IsWinner == 1]
## W = 0.99009, p-value = 0.4815

Discuss what these tests say about the newly transformed feature

Acousticness
# Acousticness plots improved, test not passed

par(mfrow = c(1, 1))

b_acousticness <- boxcox(lm(acousticness ~ 1))

lambda <- b_acousticness$x[which.max(b_acousticness$y)]
acousticness_tran <- (acousticness ^ lambda - 1) / lambda

par(mfrow = c(1, 2))

qqnorm(acousticness_tran[IsWinner == 0])
grid()               
qqline(acousticness_tran[IsWinner == 0],lwd = 2, col = "red")

qqnorm(acousticness_tran[IsWinner == 1])
grid()               
qqline(acousticness_tran[IsWinner == 1],lwd = 2, col = "red")

shapiro.test(acousticness_tran[IsWinner == 0]) # No 
## 
##  Shapiro-Wilk normality test
## 
## data:  acousticness_tran[IsWinner == 0]
## W = 0.95483, p-value = 4.126e-12
shapiro.test(acousticness_tran[IsWinner == 1]) # No
## 
##  Shapiro-Wilk normality test
## 
## data:  acousticness_tran[IsWinner == 1]
## W = 0.96627, p-value = 0.002542

Discuss what these tests say about the newly transformed feature

Duration
# duration_ms plots improved, test not passed

par(mfrow = c(1, 1))

b_duration_ms <- boxcox(lm(duration_ms ~ 1))

lambda <- b_duration_ms$x[which.max(b_duration_ms$y)]
duration_ms_tran <- (duration_ms ^ lambda - 1) / lambda

par(mfrow = c(1, 2))

qqnorm(duration_ms_tran[IsWinner == 0])
grid()               
qqline(duration_ms_tran[IsWinner == 0],lwd = 2, col = "red")

qqnorm(duration_ms_tran[IsWinner == 1])
grid()               
qqline(duration_ms_tran[IsWinner == 1],lwd = 2, col = "red")

shapiro.test(duration_ms_tran[IsWinner == 0]) # No 
## 
##  Shapiro-Wilk normality test
## 
## data:  duration_ms_tran[IsWinner == 0]
## W = 0.96454, p-value = 2.059e-10
shapiro.test(duration_ms_tran[IsWinner == 1]) # No
## 
##  Shapiro-Wilk normality test
## 
## data:  duration_ms_tran[IsWinner == 1]
## W = 0.97209, p-value = 0.008766

Discuss what these tests say about the newly transformed feature

Energy
# energy plots are pretty much the same test not passed

par(mfrow = c(1, 1))

b_energy <- boxcox(lm(energy ~ 1))

lambda <- b_energy$x[which.max(b_energy$y)]
energy_tran <- (energy ^ lambda - 1) / lambda

par(mfrow = c(1, 2))

qqnorm(energy_tran[IsWinner == 0])
grid()               
qqline(energy_tran[IsWinner == 0],lwd = 2, col = "red")

qqnorm(energy_tran[IsWinner == 1])
grid()               
qqline(energy_tran[IsWinner == 1],lwd = 2, col = "red")

shapiro.test(energy_tran[IsWinner == 0]) # No 
## 
##  Shapiro-Wilk normality test
## 
## data:  energy_tran[IsWinner == 0]
## W = 0.96009, p-value = 3.178e-11
shapiro.test(energy_tran[IsWinner == 1]) # No
## 
##  Shapiro-Wilk normality test
## 
## data:  energy_tran[IsWinner == 1]
## W = 0.96047, p-value = 0.0007932

Discuss what these tests say about the newly transformed feature

Instrumentalness
# instrumentalness can't apply the boxcox transformation because there are some 0's
# Tried to apply a linear transformation before but the plot is weird

par(mfrow = c(1, 1))

# Added a tiny value to overcome the issue of 0's
new_instrumentalness = instrumentalness + 1e-05

b_instrumentalness <- boxcox(lm(new_instrumentalness ~ 1))

lambda <- b_instrumentalness$x[which.max(b_instrumentalness$y)]
instrumentalness_tran <- (new_instrumentalness ^ lambda - 1) / lambda

par(mfrow = c(1, 2))

qqnorm(instrumentalness_tran[IsWinner == 0])
grid()               
qqline(instrumentalness_tran[IsWinner == 0],lwd = 2, col = "red")

qqnorm(instrumentalness_tran[IsWinner == 1])
grid()               
qqline(instrumentalness_tran[IsWinner == 1],lwd = 2, col = "red")

shapiro.test(instrumentalness_tran[IsWinner == 0]) # No 
## 
##  Shapiro-Wilk normality test
## 
## data:  instrumentalness_tran[IsWinner == 0]
## W = 0.88817, p-value < 2.2e-16
shapiro.test(instrumentalness_tran[IsWinner == 1]) # No
## 
##  Shapiro-Wilk normality test
## 
## data:  instrumentalness_tran[IsWinner == 1]
## W = 0.90309, p-value = 1.132e-07

Discuss what these tests say about the newly transformed feature

Liveness
# liveness plots improved a lot, test not passed because of a few points in the tails

par(mfrow = c(1, 1))

b_liveness <- boxcox(lm(liveness ~ 1))

lambda <- b_liveness$x[which.max(b_liveness$y)]
liveness_tran <- (liveness ^ lambda - 1) / lambda

par(mfrow = c(1, 2))

qqnorm(liveness_tran[IsWinner == 0])
grid()               
qqline(liveness_tran[IsWinner == 0],lwd = 2, col = "red")

qqnorm(liveness_tran[IsWinner == 1])
grid()               
qqline(liveness_tran[IsWinner == 1],lwd = 2, col = "red")

shapiro.test(liveness_tran[IsWinner == 0]) # No 
## 
##  Shapiro-Wilk normality test
## 
## data:  liveness_tran[IsWinner == 0]
## W = 0.96369, p-value = 1.429e-10
shapiro.test(liveness_tran[IsWinner == 1]) # No
## 
##  Shapiro-Wilk normality test
## 
## data:  liveness_tran[IsWinner == 1]
## W = 0.97462, p-value = 0.01533

Discuss what these tests say about the newly transformed feature

Loudness
# loudness boxcox transformation not directly applicable because the variable
# is always negative, I multiplied the values by -1 and then applied it, 
# it gave very good results, but we need to pay attention to the interpretation

par(mfrow = c(1, 1))

new_loudness = loudness * (-1)

b_loudness <- boxcox(lm(new_loudness ~ 1))

lambda <- b_loudness$x[which.max(b_loudness$y)]
loudness_tran <- (new_loudness ^ lambda - 1) / lambda

par(mfrow = c(1, 2))

qqnorm(loudness_tran[IsWinner == 0])
grid()               
qqline(loudness_tran[IsWinner == 0],lwd = 2, col = "red")

qqnorm(loudness_tran[IsWinner == 1])
grid()               
qqline(loudness_tran[IsWinner == 1],lwd = 2, col = "red")

shapiro.test(loudness_tran[IsWinner == 0]) # Yes
## 
##  Shapiro-Wilk normality test
## 
## data:  loudness_tran[IsWinner == 0]
## W = 0.99829, p-value = 0.8586
shapiro.test(loudness_tran[IsWinner == 1]) # Yes
## 
##  Shapiro-Wilk normality test
## 
## data:  loudness_tran[IsWinner == 1]
## W = 0.99337, p-value = 0.8045

Discuss what these tests say about the newly transformed feature (NOTE: Loudness was multiplied by -1, so the results are mirrored (RIGHT??))

Tempo
# tempo, slight improvement in the plots

par(mfrow = c(1, 1))

b_tempo <- boxcox(lm(tempo ~ 1))

lambda <- b_tempo$x[which.max(b_tempo$y)]
tempo_tran <- (tempo ^ lambda - 1) / lambda

par(mfrow = c(1, 2))

qqnorm(tempo_tran[IsWinner == 0])
grid()               
qqline(tempo_tran[IsWinner == 0],lwd = 2, col = "red")

qqnorm(tempo_tran[IsWinner == 1])
grid()               
qqline(tempo_tran[IsWinner == 1],lwd = 2, col = "red")

shapiro.test(tempo_tran[IsWinner == 0]) # No
## 
##  Shapiro-Wilk normality test
## 
## data:  tempo_tran[IsWinner == 0]
## W = 0.99334, p-value = 0.01358
shapiro.test(tempo_tran[IsWinner == 1]) # No
## 
##  Shapiro-Wilk normality test
## 
## data:  tempo_tran[IsWinner == 1]
## W = 0.96738, p-value = 0.003199

Discuss what these tests say about the newly transformed feature

Valence
# valence, pretty much the same

par(mfrow = c(1, 1))

b_valence <- boxcox(lm(valence ~ 1))

lambda <- b_valence$x[which.max(b_valence$y)]
valence_tran <- (valence ^ lambda - 1) / lambda

par(mfrow = c(1, 2))

qqnorm(valence_tran[IsWinner == 0])
grid()               
qqline(valence_tran[IsWinner == 0],lwd = 2, col = "red")

qqnorm(valence_tran[IsWinner == 1])
grid()               
qqline(valence_tran[IsWinner == 1],lwd = 2, col = "red")

shapiro.test(valence_tran[IsWinner == 0]) # No
## 
##  Shapiro-Wilk normality test
## 
## data:  valence_tran[IsWinner == 0]
## W = 0.98604, p-value = 3.22e-05
shapiro.test(valence_tran[IsWinner == 1]) # No
## 
##  Shapiro-Wilk normality test
## 
## data:  valence_tran[IsWinner == 1]
## W = 0.95626, p-value = 0.0003542

Discuss what these tests say about the newly transformed feature

(SHOULD WE COMPARE A FEW OF THE QQ PLOTS BEFORE AND AFTER THE TRANSFORMATION SIDE-BY-SIDE AS A GOOD VISUAL? fOR THE FEATURES THAT CHANGED AND FOR ONE THAT DID NOT?)

Examination of best BoxCox plots???

par(mfrow = c(1, 1))

b_data_followers <- boxcox(lm(data$followers ~ 1))

lambda <- b_data_followers$x[which.max(b_data_followers$y)]
data_followers_tran <- (data$followers ^ lambda - 1) / lambda

b_data_acousticness <- boxcox(lm(data$acousticness ~ 1))

lambda <- b_data_acousticness$x[which.max(b_data_acousticness$y)]
data_acousticness_tran <- (data$acousticness ^ lambda - 1) / lambda

b_data_duration_ms <- boxcox(lm(data$duration_ms ~ 1))

lambda <- b_data_duration_ms$x[which.max(b_data_duration_ms$y)]
data_duration_ms_tran <- (data$duration_ms ^ lambda - 1) / lambda

b_data_liveness <- boxcox(lm(data$liveness ~ 1))

lambda <- b_data_liveness$x[which.max(b_data_liveness$y)]
data_liveness_tran <- (data$liveness ^ lambda - 1) / lambda

neg_loudness = data$loudness * (-1)
b_data_loudness <- boxcox(lm(neg_loudness ~ 1))

lambda <- b_data_loudness$x[which.max(b_data_loudness$y)]
data_loudness_tran <- (neg_loudness ^ lambda - 1) / lambda

b_data_tempo <- boxcox(lm(data$tempo ~ 1))

lambda <- b_data_tempo$x[which.max(b_data_tempo$y)]
data_tempo_tran <- (data$tempo ^ lambda - 1) / lambda


tran_data = matrix(c(data$IsWinner, data_followers_tran, data_acousticness_tran, data_duration_ms_tran, 
            data$energy, data$instrumentalness, data_liveness_tran, data_loudness_tran, 
            data_tempo_tran, data$valence), ncol = 10)

training_tran_data = tran_data[train_ind,]

test_tran_data = tran_data[-train_ind,]

colnames_tran_data = c("IsWinner", "followers_tran", "acousticness_tran", "duration_ms_tran", "energy",
                       "instrumentalness", "liveness_tran", "loudness_tran", "tempo_tran", "valence")

colnames(training_tran_data) = colnames_tran_data
colnames(test_tran_data) = colnames_tran_data
colnames(tran_data) = colnames_tran_data


training_tran_data = as.data.frame(training_tran_data)
test_tran_data = as.data.frame(test_tran_data)
tran_data = as.data.frame(tran_data)

Discuss what these boxcox plots show us, what what we take from them for the examination?

Examining MORE LDA???

## LDA

simple_lda = lda(IsWinner ~ followers_tran + acousticness_tran + duration_ms_tran + 
                energy + instrumentalness + liveness_tran + loudness_tran + tempo_tran +
                valence, data = training_tran_data, family = "binomial")


pred_simple_lda = predict(simple_lda, newdata = test_tran_data, type = "Response")

# ROC curve

roc.out <- roc(test_set$IsWinner, as.numeric(pred_simple_lda$class) - 1)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc.out, print.auc=TRUE, legacy.axes=TRUE, xlab="False positive rate", ylab="True positive rate")

auc(roc.out)
## Area under the curve: 0.4967

Discuss the findings of the LDA and ROC curve

Oversampled LDA Examination

# Oversampled LDA

oversampled_train_tran_data = ovun.sample(IsWinner ~., data = training_tran_data, method = "over", p = 0.5, seed = 42)$data


simple_over_lda = lda(IsWinner ~ followers_tran + acousticness_tran + duration_ms_tran + 
                   energy + instrumentalness + liveness_tran + loudness_tran + tempo_tran +
                   valence, data = oversampled_train_tran_data, family = "binomial")


pred_simple_over_lda = predict(simple_over_lda, newdata = test_tran_data, type = "Response")

# ROC curve

roc.out <- roc(test_set$IsWinner, as.numeric(pred_simple_over_lda$class) - 1)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc.out, print.auc=TRUE, legacy.axes=TRUE, xlab="False positive rate", ylab="True positive rate")

auc(roc.out)
## Area under the curve: 0.5767

Discuss the oversampled LDA findings

Predictions

Making predictions ont he oversampled LDA

Threshold = 0.4
# Threshold 0.4

lda_over_predictions_04 = ifelse(pred_simple_over_lda$posterior[,1] > 0.4, 1, 0)
lda_over_accuracy_04 = sum(lda_over_predictions_04 == test_set[2]) / dim(test_set[2])[1]

table(test_set$IsWinner, lda_over_predictions_04)
##    lda_over_predictions_04
##       0   1
##   0  25 125
##   1   6  18
false_positive_lda_over_04 = table(test_set$IsWinner, lda_over_predictions_04)[3]
negative_lda_over_04 = table(test_set$IsWinner, lda_over_predictions_04)[1] + table(test_set$IsWinner, lda_over_predictions_04)[2]
typeIerror_lda_over_04 = false_positive_lda_over_04 / negative_lda_over_04

true_positive_lda_over_04 = table(test_set$IsWinner, lda_over_predictions_04)[4]
positive_lda_over_04 = table(test_set$IsWinner, lda_over_predictions_04)[2] + table(test_set$IsWinner, lda_over_predictions_04)[4]
sensitivity_lda_over_04 = true_positive_lda_over_04 / positive_lda_over_04

Discuss what the 0.4 threshold finds

Threshold = 0.5
# Threshold 0.5

lda_over_predictions_05 = list(pred_simple_over_lda$class)
lda_over_accuracy_05 = sum(lda_over_predictions_05 == test_set[2]) / dim(test_set[2])[1]

table(test_set$IsWinner, lda_over_predictions_05[[1]])
##    
##      0  1
##   0 98 52
##   1 12 12
false_positive_lda_over_05 = table(test_set$IsWinner, lda_over_predictions_05[[1]])[3]
negative_lda_over_05 = table(test_set$IsWinner, lda_over_predictions_05[[1]])[1] + table(test_set$IsWinner, lda_over_predictions_05[[1]])[2]
typeIerror_lda_over_05 = false_positive_lda_over_05 / negative_lda_over_05

true_positive_lda_over_05 = table(test_set$IsWinner, lda_over_predictions_05[[1]])[4]
positive_lda_over_05 = table(test_set$IsWinner, lda_over_predictions_05[[1]])[2] + table(test_set$IsWinner, lda_over_predictions_05[[1]])[4]
sensitivity_lda_over_05 = true_positive_lda_over_05 / positive_lda_over_05

Discuss what the 0.5 prediction threshold finds compared to the 0.4

QDA and ROC of the LDA and Oversampled LDA

## QDA

qda = qda(IsWinner ~ followers_tran + acousticness_tran + duration_ms_tran + 
            energy + instrumentalness + liveness_tran + loudness_tran + tempo_tran +
            valence, data = training_tran_data, family = "binomial")


pred_qda = predict(qda, newdata = test_tran_data, type = "Response")

# ROC

roc.out <- roc(test_set$IsWinner, as.numeric(pred_qda$class) - 1)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc.out, print.auc=TRUE, legacy.axes=TRUE, xlab="False positive rate", ylab="True positive rate")

auc(roc.out)
## Area under the curve: 0.5358
# QDA oversampled

qda_over = qda(IsWinner ~ followers_tran + acousticness_tran + duration_ms_tran + 
            energy + instrumentalness + liveness_tran + loudness_tran + tempo_tran +
            valence, data = oversampled_train_tran_data, family = "binomial")


pred_qda_over = predict(qda_over, newdata = test_tran_data, type = "Response")

# ROC

roc.out <- roc(test_set$IsWinner, as.numeric(pred_qda_over$class) - 1)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc.out, print.auc=TRUE, legacy.axes=TRUE, xlab="False positive rate", ylab="True positive rate")

auc(roc.out)
## Area under the curve: 0.5758

Discuss what these show:

Predictions

Making more predictions (Using what? Clarify and explain)

Threshold = 0.4
# Predictions

# Threshold 0.4

qda_over_predictions_04 = ifelse(pred_qda_over$posterior[,1] > 0.4, 1, 0)
qda_over_accuracy_04 = sum(qda_over_predictions_04 == test_set[2]) / dim(test_set[2])[1]

table(test_set$IsWinner, qda_over_predictions_04)
##    qda_over_predictions_04
##      0  1
##   0 53 97
##   1 12 12
false_positive_qda_over_04 = table(test_set$IsWinner, qda_over_predictions_04)[3]
negative_qda_over_04 = table(test_set$IsWinner, qda_over_predictions_04)[1] + table(test_set$IsWinner, qda_over_predictions_04)[2]
typeIerror_qda_over_04 = false_positive_qda_over_04 / negative_qda_over_04

true_positive_qda_over_04 = table(test_set$IsWinner, qda_over_predictions_04)[4]
positive_qda_over_04 = table(test_set$IsWinner, qda_over_predictions_04)[2] + table(test_set$IsWinner, qda_over_predictions_04)[4]
sensitivity_qda_over_04 = true_positive_qda_over_04 / positive_qda_over_04

Discuss what this predicitons found

Threshold = 0.5
# Threshold 0.5

qda_over_predictions_05 = list(pred_qda_over$class)
qda_over_accuracy_05 = sum(qda_over_predictions_05 == test_set[2]) / dim(test_set[2])[1]

table(test_set$IsWinner, qda_over_predictions_05[[1]])
##    
##      0  1
##   0 79 71
##   1  9 15
false_positive_qda_over_05 = table(test_set$IsWinner, qda_over_predictions_05[[1]])[3]
negative_qda_over_05 = table(test_set$IsWinner, qda_over_predictions_05[[1]])[1] + table(test_set$IsWinner, qda_over_predictions_05[[1]])[2]
typeIerror_qda_over_05 = false_positive_qda_over_05 / negative_qda_over_05

true_positive_qda_over_05 = table(test_set$IsWinner, qda_over_predictions_05[[1]])[4]
positive_qda_over_05 = table(test_set$IsWinner, qda_over_predictions_05[[1]])[2] + table(test_set$IsWinner, qda_over_predictions_05[[1]])[4]
sensitivity_qda_over_05 = true_positive_qda_over_05 / positive_qda_over_05

Discuss what this prediction found at this threshold and compare it to 0.4

Regression

Discuss what regression is, how it differs from what we have looked at so far, and what it may unvail

Ridge Regression

Discuss what specifically this ridge regression is; benefits and drawbacks?

## Ridge regression

 
ridge_cv = cv.glmnet(data.matrix(oversampled_train_data[,-1]), oversampled_train_data$IsWinner,
                      alpha = 0, family = "binomial", type.measure = "class")

plot(ridge_cv)

lambda_opt_ridge <- ridge_cv$lambda.min

pred_ridge = predict(ridge_cv, data.matrix(test_set[, c(-1, -2)]), type = "class", s = lambda_opt_ridge)

table(test_set$IsWinner, pred_ridge)
##    pred_ridge
##      0  1
##   0 96 54
##   1 12 12

Discuss the results: What does it show?

Lasso Regression

What is Lasso Regression? How does it differe from regression and ridge regression? What are its benefits and drawbacks?

## Lasso regression

lasso_cv = cv.glmnet(data.matrix(oversampled_train_data[,-1]), oversampled_train_data$IsWinner,
                     alpha = 1, family = "binomial", type.measure = "class")

plot(lasso_cv)

lambda_opt_lasso <- lasso_cv$lambda.min

pred_lasso = predict(lasso_cv, data.matrix(test_set[, c(-1, -2)]), type = "class", s = lambda_opt_ridge)

table(test_set$IsWinner, pred_lasso)
##    pred_lasso
##      0  1
##   0 94 56
##   1 12 12

What does this show? What insight does it give? Interpret the results.

KNN

What is KNN? How does it work?

# K-NN

min_max_norm = function(x) {
  (x - min(x)) / (max(x) - min(x))
}

normalized_data = as.data.frame(lapply(data[,c(-1, -2, -10, -13, -15)], min_max_norm))

IsWinner_norm = data$IsWinner

normalized_data = cbind(IsWinner_norm, normalized_data)

training_norm_data = normalized_data[train_ind,]

test_norm_data = normalized_data[-train_ind,]

Discuss the overview of the model, introduce hyperparameter tuning:

# Selecting k

kmax = 100

test_error = numeric(kmax)

for (k in 1:kmax) {
  knn_pred = as.factor(knn(training_norm_data[,-1], test_norm_data[,-1],
                            cl = training_norm_data$IsWinner_norm, k = k))
  cm = confusionMatrix(data = knn_pred, reference = as.factor(test_norm_data$IsWinner_norm))
  test_error[k] = 1 - cm$overall[1]
}


k_min = which.min(test_error)

Discuss the selection of the k; print the value for the markdown??

knn = knn(training_norm_data[,-1], test_norm_data[,-1],
          cl = training_norm_data$IsWinner_norm, k = k_min)

knn_pred_min = as.factor(knn)

table(test_norm_data$IsWinner_norm, knn)
##    knn
##       0   1
##   0 150   0
##   1  24   0

Discuss the findings of the KNN model; how we will adapt it or the data to improve it.

Oversampling

Repeat the above process (Find k, fit model) using oversampled data:

# oversampled

test_over_error = numeric(kmax)

normalized_over_data = as.data.frame(lapply(oversampled_train_data[,c(-9, -12, -14)], min_max_norm))

training_norm_data_over = normalized_over_data[train_ind,]

for (k in 1:kmax) {
  knn_over_pred = as.factor(knn(training_norm_data_over[,-1], test_norm_data[,-1],
                           cl = training_norm_data$IsWinner_norm, k = k))
  cm_over = confusionMatrix(data = knn_over_pred, reference = as.factor(test_norm_data$IsWinner_norm))
  test_over_error[k] = 1 - cm_over$overall[1]
}


k_min_over = which.min(test_over_error)


knn_over = knn(training_norm_data_over[,-1], test_norm_data[,-1],
          cl = training_norm_data$IsWinner_norm, k = k_min_over)

knn_pred_min_over = as.factor(knn_over)

table(test_norm_data$IsWinner_norm, knn_over)
##    knn_over
##       0   1
##   0 150   0
##   1  22   2

Discuss the results of the oversampled model.

Results and Conclusions

In the end…

SHOW TABLES, GRAPHS, CHARTS, HIGHLIGHT THE MOST IMPORTANT INFORMATION ONLY. DO NOT INUNDATE THIS SECTION.

Future work could expand upon our research by…